🎵 Operaciones de Aprendizaje Automático (MLOps)¶

Institution Program Phase Deadline

╔══════════════════════════════════════════════════════════════════╗
║                   🎼 TURKISH MUSIC EMOTION                       ║
║              Machine Learning Operations Project                 ║
╚══════════════════════════════════════════════════════════════════╝

👨‍🏫 Equipo Docente¶

👔 Profesores Titulares

  • Dr. Gerardo Rodríguez Hernández
  • Maestro Ricardo Valdez Hernández

🎓 Equipo de Apoyo

  • Maestra María Mylen Treviño Elizondo
    Profesora Asistente
  • M. en C. José Ángel Martínez Navarro
    Profesor Tutor

🎯 Dataset del Proyecto¶

🎵 Turkish Music Emotion Dataset¶

Dataset Classes Samples

😊 Happy 😢 Sad 😠 Angry 😌 Relax
100 100 100 100

👥 Equipo de Desarrollo¶

David Cruz Beltrán

David Cruz Beltrán

Matricula

🔧 Software Engineer
Data Pipeline & Versioning

Javier Augusto Rebull Saucedo

Javier Augusto Rebull Saucedo

Matricula

⚙️ SRE / Data Engineer
DevOps & Infrastructure

Sandra Luz Cervantes Espinoza

Sandra Luz Cervantes Espinoza

Matricula

🤖 ML Engineer / Data Scientist
Model Development & Analysis

"Aplicando MLOps para clasificación de emociones en música turca" 🎼✨


📋 Objetivos del Proyecto¶

🎯 Análisis¶

📝 ML Canvas
💡 Propuesta de Valor
🔍 Requerimientos

🛠️ Desarrollo¶

🧹 EDA & Limpieza
🔄 Versionado DVC
⚙️ Preprocesamiento

🤖 Modelado¶

🏗️ Construcción
🎛️ Hyperparameter Tuning
📊 Evaluación

🔧 Stack Tecnológico¶

Categoría Herramientas
💻 Lenguaje Python
📊 Data Analysis Pandas NumPy
🤖 ML Framework Scikit-Learn
🔄 Versioning DVC Git
📈 Visualization Matplotlib Seaborn

🎼 Clasificación de Emociones¶

Emoción Características Musicales Aplicaciones
😊 Happy Tempo rápido, tonalidad mayor Playlists energéticas, marketing
😢 Sad Tempo lento, tonalidad menor Terapia musical, cine
😠 Angry Alta intensidad, disonancia Detección de contenido, videojuegos
😌 Relax Tempo moderado, armonías suaves Meditación, ambientes

🚀 Metodología MLOps¶

graph LR
    A[📥 Data Ingestion] --> B[🧹 Data Cleaning]
    B --> C[🔍 EDA]
    C --> D[⚙️ Feature Engineering]
    D --> E[🔄 Data Versioning - DVC]
    E --> F[🤖 Model Training]
    F --> G[🎛️ Hyperparameter Tuning]
    G --> H[📊 Model Evaluation]
    H --> I[📝 Documentation]

📈 Entregables¶

# Componente Estado Peso
1️⃣ ML Canvas & Propuesta de Valor 🔄 En Progreso 15%
2️⃣ EDA & Data Cleaning 🔄 En Progreso 20%
3️⃣ Feature Engineering 🔄 En Progreso 15%
4️⃣ Data Versioning (DVC) 🔄 En Progreso 15%
5️⃣ Model Development 🔄 En Progreso 20%
6️⃣ Evaluation & Documentation 🔄 En Progreso 15%

🔗 Enlaces del Proyecto¶

GitHub Colab Dataset


🎵 Transformando emociones musicales en conocimiento mediante MLOps 🤖¶


Made with Love ITESM MLOps


📅 Fecha de entrega: 13 de Octubre 2025 • 01:00 hrs
📧 Contacto: A través de Canvas


Proyecto desarrollado como parte de la Maestría en Inteligencia Artificial Aplicada
Instituto Tecnológico y de Estudios Superiores de Monterrey

Información:

  • Dataset Turkish Music Emotion archivo "Acoustic Features.csv".
In [1]:
#########################################
#   Manejo y Análisis de Datos          #
#########################################
import numpy as np      # Para computación numérica y manejo de arrays
import pandas as pd     # Para manipulación y análisis de datos tabulares (DataFrames)

#################################
#   Visualización de Datos      #
#################################
import matplotlib.pyplot as plt # Para crear gráficos y visualizaciones estáticas
import seaborn as sns           # Para visualizaciones estadísticas atractivas

###################################################
#   Utilidades del Sistema y Matemáticas          #
###################################################
import os                       # Para interactuar con el sistema operativo
import math                     # Para funciones matemáticas básicas
from scipy import stats         # Para funciones estadísticas avanzadas

################################################
#   Aprendizaje Automático (Scikit-learn)      #
################################################

# --- Preprocesamiento y Transformación de Datos ---
from sklearn.preprocessing import LabelEncoder, StandardScaler  # Para codificar y estandarizar características
from sklearn.preprocessing import FunctionTransformer         # Para aplicar funciones personalizadas a los datos
from sklearn.preprocessing import PowerTransformer            # Para aplicar transformaciones de potencia a los datos
from sklearn.decomposition import PCA                         # Para reducción de dimensionalidad
from sklearn.pipeline import Pipeline                         # Para encadenar pasos de procesamiento y modelado

# --- Selección y Evaluación de Modelos ---
from sklearn.model_selection import train_test_split, cross_val_score # Para dividir datos y realizar validación cruzada
from sklearn.model_selection import GridSearchCV                      # Para búsqueda de hiperparámetros
from sklearn.metrics import accuracy_score, confusion_matrix      # Para medir la precisión y ver la matriz de confusión
from sklearn.metrics import classification_report                 # Para un reporte detallado de métricas

# --- Modelos de Clasificación ---
from sklearn.linear_model import LogisticRegression     # Modelo de Regresión Logística
from sklearn.neighbors import KNeighborsClassifier      # Modelo K-Vecinos más Cercanos
from sklearn.svm import SVC                             # Máquinas de Vectores de Soporte (SVM)
from sklearn.naive_bayes import GaussianNB              # Clasificador Naive Bayes Gaussiano
from sklearn.tree import DecisionTreeClassifier         # Modelo de Árbol de Decisión
from sklearn.ensemble import RandomForestClassifier     # Modelo de Bosque Aleatorio (Random Forest)
from sklearn.neural_network import MLPClassifier        # Red Neuronal (Perceptrón Multicapa)

#################################################
#   Librerías Personalizadas del Proyecto 📦    #
#################################################
from acoustic_ml.dataset import load_raw_data       # Para cargar los datos crudos del proyecto
from acoustic_ml.features import create_features    # Para crear nuevas características para el modelo
from acoustic_ml.modeling.predict import load_model, predict # Para cargar y usar un modelo ya entrenado
In [11]:
from acoustic_ml.dataset import load_raw_data                     # Importa la función necesaria del módulo

df = load_raw_data()                                              # Ejecuta la función para cargar los datos
print(f"✓ Dataset cargado: {df.shape[0]} filas y {df.shape[1]} columnas") # Verifica y reporta las dimensiones del dataset
✓ Dataset cargado: 400 filas y 51 columnas

📊 Proceso de Análisis de Datos¶


🔍 Análisis Exploratorio (EDA)¶

  • Análisis descriptivo → Estadísticas y resumen de datos
  • Análisis de variables numéricas → Distribuciones y tendencias
  • Análisis de variables de texto → Frecuencias y patrones
  • Análisis de correlación → Relaciones bivariantes y multivariantes

🧹 Preprocesamiento¶

  • Valores faltantes → Identificación y tratamiento
  • Valores atípicos → Detección y manejo

⚙️ Ingeniería de Características¶

Creación y transformación de variables para mejorar el modelo


🚀 Entrenamiento y Evaluación del Modelo¶

Desarrollo, validación y optimización del modelo predictivo

🔬 1. Análisis Exploratorio de Datos (EDA)¶

📊 Estadísticas Descriptivas¶

Medidas clave: Media · Mediana · Desviación estándar · Cuartiles · Asimetría · Curtosis

Explorando la estructura, distribución y calidad de los datos

In [14]:
def info_df(df):
    """
    Genera un DataFrame de resumen estilizado con información clave de cada columna.
    """
    print(f"Análisis del DataFrame: {df.shape[0]} filas y {df.shape[1]} columnas")
    
    # 1. Crear el DataFrame de resumen
    resumen = pd.DataFrame({
        'Tipo de Dato': df.dtypes,
        'Valores No Nulos': df.count(),
        'Valores Nulos': df.isnull().sum(),
        '% Nulos': round(df.isnull().sum() * 100 / len(df), 2),
        'Valores Únicos': df.nunique(),
        'Cardinalidad (%)': round(df.nunique() * 100 / len(df), 2)
    })

    # 2. Ordenar por el porcentaje de nulos para priorizar
    resumen = resumen.sort_values(by='% Nulos', ascending=False)

    # 3. Aplicar estilo visual
    styled_resumen = (resumen.style
                      .background_gradient(cmap='viridis', subset=['% Nulos'])
                      .background_gradient(cmap='plasma_r', subset=['Cardinalidad (%)'])
                      .format({'% Nulos': '{:.2f}%', 'Cardinalidad (%)': '{:.2f}%'})
                      .bar(subset=['Valores No Nulos'], color='#5fba7d')
                      .set_caption("Resumen Analítico del Dataset")
                     )
    
    return styled_resumen

# ¡Llama a la función con nuestro DataFrame!
info_df(df)
Análisis del DataFrame: 400 filas y 51 columnas
Out[14]:
Resumen Analítico del Dataset
  Tipo de Dato Valores No Nulos Valores Nulos % Nulos Valores Únicos Cardinalidad (%)
Class object 400 0 0.00% 4 1.00%
_Chromagram_Mean_6 float64 400 0 0.00% 241 60.25%
_Spectralspread_Mean float64 400 0 0.00% 388 97.00%
_Spectralskewness_Mean float64 400 0 0.00% 357 89.25%
_Spectralkurtosis_Mean float64 400 0 0.00% 381 95.25%
_Spectralflatness_Mean float64 400 0 0.00% 94 23.50%
_EntropyofSpectrum_Mean float64 400 0 0.00% 134 33.50%
_Chromagram_Mean_1 float64 400 0 0.00% 259 64.75%
_Chromagram_Mean_2 float64 400 0 0.00% 240 60.00%
_Chromagram_Mean_3 float64 400 0 0.00% 263 65.75%
_Chromagram_Mean_4 float64 400 0 0.00% 240 60.00%
_Chromagram_Mean_5 float64 400 0 0.00% 286 71.50%
_Chromagram_Mean_7 float64 400 0 0.00% 243 60.75%
_Brightness_Mean float64 400 0 0.00% 277 69.25%
_Chromagram_Mean_8 float64 400 0 0.00% 269 67.25%
_Chromagram_Mean_9 float64 400 0 0.00% 262 65.50%
_Chromagram_Mean_10 float64 400 0 0.00% 245 61.25%
_Chromagram_Mean_11 float64 400 0 0.00% 273 68.25%
_Chromagram_Mean_12 float64 400 0 0.00% 252 63.00%
_HarmonicChangeDetectionFunction_Mean float64 400 0 0.00% 178 44.50%
_HarmonicChangeDetectionFunction_Std float64 400 0 0.00% 159 39.75%
_HarmonicChangeDetectionFunction_Slope float64 400 0 0.00% 237 59.25%
_HarmonicChangeDetectionFunction_PeriodFreq float64 400 0 0.00% 40 10.00%
_HarmonicChangeDetectionFunction_PeriodAmp float64 400 0 0.00% 196 49.00%
_Spectralcentroid_Mean float64 400 0 0.00% 388 97.00%
_Pulseclarity_Mean float64 400 0 0.00% 266 66.50%
_RMSenergy_Mean float64 400 0 0.00% 196 49.00%
_MFCC_Mean_8 float64 400 0 0.00% 273 68.25%
_Lowenergy_Mean float64 400 0 0.00% 166 41.50%
_Fluctuation_Mean float64 400 0 0.00% 377 94.25%
_Tempo_Mean float64 400 0 0.00% 388 97.00%
_MFCC_Mean_1 float64 400 0 0.00% 354 88.50%
_MFCC_Mean_2 float64 400 0 0.00% 347 86.75%
_MFCC_Mean_3 float64 400 0 0.00% 319 79.75%
_MFCC_Mean_4 float64 400 0 0.00% 316 79.00%
_MFCC_Mean_5 float64 400 0 0.00% 297 74.25%
_MFCC_Mean_6 float64 400 0 0.00% 297 74.25%
_MFCC_Mean_7 float64 400 0 0.00% 304 76.00%
_MFCC_Mean_9 float64 400 0 0.00% 278 69.50%
_Eventdensity_Mean float64 400 0 0.00% 163 40.75%
_MFCC_Mean_10 float64 400 0 0.00% 271 67.75%
_MFCC_Mean_11 float64 400 0 0.00% 253 63.25%
_MFCC_Mean_12 float64 400 0 0.00% 272 68.00%
_MFCC_Mean_13 float64 400 0 0.00% 259 64.75%
_Roughness_Mean float64 400 0 0.00% 388 97.00%
_Roughness_Slope float64 400 0 0.00% 292 73.00%
_Zero-crossingrate_Mean float64 400 0 0.00% 388 97.00%
_AttackTime_Mean float64 400 0 0.00% 61 15.25%
_AttackTime_Slope float64 400 0 0.00% 274 68.50%
_Rolloff_Mean float64 400 0 0.00% 388 97.00%
_HarmonicChangeDetectionFunction_PeriodEntropy float64 400 0 0.00% 26 6.50%
In [13]:
estadisticas_descriptivas = df.describe()
estadisticas_descriptivas
Out[13]:
_RMSenergy_Mean _Lowenergy_Mean _Fluctuation_Mean _Tempo_Mean _MFCC_Mean_1 _MFCC_Mean_2 _MFCC_Mean_3 _MFCC_Mean_4 _MFCC_Mean_5 _MFCC_Mean_6 ... _Chromagram_Mean_9 _Chromagram_Mean_10 _Chromagram_Mean_11 _Chromagram_Mean_12 _HarmonicChangeDetectionFunction_Mean _HarmonicChangeDetectionFunction_Std _HarmonicChangeDetectionFunction_Slope _HarmonicChangeDetectionFunction_PeriodFreq _HarmonicChangeDetectionFunction_PeriodAmp _HarmonicChangeDetectionFunction_PeriodEntropy
count 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 ... 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000
mean 0.134650 0.553605 7.145932 123.682020 2.456422 0.071890 0.488065 0.030465 0.178897 0.038307 ... 0.354632 0.590975 0.342340 0.385620 0.328213 0.192997 -0.000157 1.762288 0.769690 0.966712
std 0.064368 0.050750 2.280145 34.234344 0.799262 0.537865 0.294607 0.275839 0.195230 0.203754 ... 0.334976 0.357981 0.315808 0.348117 0.055520 0.047092 0.104743 0.930352 0.072107 0.003841
min 0.010000 0.302000 3.580000 48.284000 0.323000 -3.484000 -0.870000 -1.636000 -0.494000 -0.916000 ... 0.000000 0.000000 0.000000 0.000000 0.112000 0.060000 -0.285000 0.187000 0.530000 0.939000
25% 0.085000 0.523000 5.859500 101.490250 1.948500 -0.262750 0.281250 -0.117000 0.061250 -0.078250 ... 0.066750 0.264500 0.059500 0.060750 0.290750 0.160000 -0.058000 0.961000 0.725000 0.965000
50% 0.128000 0.553000 6.734000 120.132500 2.389500 0.068500 0.464500 0.044500 0.181000 0.049500 ... 0.247000 0.612000 0.247000 0.296500 0.333000 0.190000 -0.002000 1.682000 0.786000 0.967000
75% 0.174000 0.583250 7.823500 148.986250 2.860250 0.413250 0.686000 0.198250 0.288500 0.151250 ... 0.612000 1.000000 0.565250 0.670750 0.367250 0.226000 0.063250 2.243000 0.824000 0.969000
max 0.431000 0.703000 23.475000 195.026000 5.996000 1.937000 1.622000 1.126000 1.055000 0.799000 ... 1.000000 1.000000 1.000000 1.000000 0.488000 0.340000 0.442000 4.486000 0.908000 0.977000

8 rows × 50 columns

In [16]:
def descripcion_df(df):
    """
    Genera una tabla de estadísticas descriptivas completa y estilizada 
    para las columnas numéricas.
    """
    # Seleccionar solo columnas numéricas
    df_num = df.select_dtypes(include=['number'])
    
    if df_num.empty:
        print("No se encontraron columnas numéricas en el DataFrame.")
        return

    # Calcular estadísticas básicas
    desc = df_num.describe().T
    
    # Añadir estadísticas avanzadas
    desc['skew'] = df_num.skew()
    desc['kurtosis'] = df_num.kurt()
    desc['median'] = df_num.median()
    
    # Reordenar para que la mediana esté junto a la media
    desc = desc[['count', 'mean', 'median', 'std', 'min', '25%', '50%', '75%', 'max', 'skew', 'kurtosis']]
    
    print(f"Estadísticas Descriptivas para {df_num.shape[1]} columnas numéricas:")
    
    # Aplicar estilo para resaltar valores importantes
    styled_desc = (desc.style
                   .background_gradient(cmap='coolwarm', subset=['skew', 'kurtosis'])
                   .format('{:.2f}')
                   .set_caption("Estadísticas Descriptivas Avanzadas")
                  )
    return styled_desc

# ¡Llama a la función con nuestro DataFrame!
descripcion_df(df)
Estadísticas Descriptivas para 50 columnas numéricas:
Out[16]:
Estadísticas Descriptivas Avanzadas
  count mean median std min 25% 50% 75% max skew kurtosis
_RMSenergy_Mean 400.00 0.13 0.13 0.06 0.01 0.09 0.13 0.17 0.43 0.71 0.62
_Lowenergy_Mean 400.00 0.55 0.55 0.05 0.30 0.52 0.55 0.58 0.70 -0.39 2.17
_Fluctuation_Mean 400.00 7.15 6.73 2.28 3.58 5.86 6.73 7.82 23.48 2.89 12.73
_Tempo_Mean 400.00 123.68 120.13 34.23 48.28 101.49 120.13 148.99 195.03 0.12 -0.63
_MFCC_Mean_1 400.00 2.46 2.39 0.80 0.32 1.95 2.39 2.86 6.00 0.87 1.95
_MFCC_Mean_2 400.00 0.07 0.07 0.54 -3.48 -0.26 0.07 0.41 1.94 -0.68 4.91
_MFCC_Mean_3 400.00 0.49 0.46 0.29 -0.87 0.28 0.46 0.69 1.62 0.11 1.19
_MFCC_Mean_4 400.00 0.03 0.04 0.28 -1.64 -0.12 0.04 0.20 1.13 -0.70 4.08
_MFCC_Mean_5 400.00 0.18 0.18 0.20 -0.49 0.06 0.18 0.29 1.05 0.08 1.19
_MFCC_Mean_6 400.00 0.04 0.05 0.20 -0.92 -0.08 0.05 0.15 0.80 -0.28 2.58
_MFCC_Mean_7 400.00 0.06 0.07 0.18 -0.94 -0.04 0.07 0.17 0.57 -0.79 3.35
_MFCC_Mean_8 400.00 0.04 0.04 0.17 -0.74 -0.05 0.04 0.13 0.73 -0.11 2.86
_MFCC_Mean_9 400.00 0.02 0.02 0.16 -0.62 -0.07 0.02 0.12 0.54 -0.13 1.29
_MFCC_Mean_10 400.00 0.03 0.03 0.15 -0.54 -0.06 0.03 0.13 0.51 -0.25 1.08
_MFCC_Mean_11 400.00 0.03 0.04 0.14 -0.49 -0.04 0.04 0.11 0.49 -0.30 0.91
_MFCC_Mean_12 400.00 0.02 0.02 0.13 -0.42 -0.06 0.02 0.09 0.35 -0.35 0.40
_MFCC_Mean_13 400.00 0.02 0.04 0.13 -0.62 -0.05 0.04 0.10 0.54 -0.49 2.02
_Roughness_Mean 400.00 527.68 367.58 521.22 0.94 169.19 367.58 734.37 3899.85 2.08 6.73
_Roughness_Slope 400.00 0.07 0.07 0.17 -0.53 -0.03 0.07 0.17 0.58 -0.02 0.42
_Zero-crossingrate_Mean 400.00 997.25 893.49 524.90 149.49 592.27 893.49 1303.49 3147.91 0.93 0.69
_AttackTime_Mean 400.00 0.03 0.03 0.02 0.01 0.02 0.03 0.03 0.17 3.38 17.11
_AttackTime_Slope 400.00 -0.00 0.01 0.15 -0.47 -0.09 0.01 0.09 0.60 -0.01 0.47
_Rolloff_Mean 400.00 5691.07 5648.63 2293.40 887.15 3933.55 5648.63 7355.89 11508.30 0.10 -0.61
_Eventdensity_Mean 400.00 2.78 2.77 1.33 0.23 1.74 2.77 3.69 7.95 0.48 0.13
_Pulseclarity_Mean 400.00 0.25 0.22 0.16 0.01 0.13 0.22 0.33 0.86 1.16 1.21
_Brightness_Mean 400.00 0.43 0.45 0.13 0.05 0.35 0.45 0.53 0.74 -0.40 -0.07
_Spectralcentroid_Mean 400.00 2581.17 2547.68 863.52 606.52 1981.56 2547.68 3182.57 5326.38 0.23 -0.13
_Spectralspread_Mean 400.00 3082.39 3150.95 767.65 814.82 2506.77 3150.95 3684.33 4721.48 -0.26 -0.43
_Spectralskewness_Mean 400.00 1.87 1.69 0.88 0.39 1.33 1.69 2.18 7.86 2.28 9.12
_Spectralkurtosis_Mean 400.00 7.35 5.22 8.62 1.93 3.88 5.22 7.85 122.00 7.97 89.68
_Spectralflatness_Mean 400.00 0.05 0.05 0.03 0.01 0.03 0.05 0.06 0.21 1.49 5.17
_EntropyofSpectrum_Mean 400.00 0.87 0.88 0.04 0.74 0.85 0.88 0.90 0.94 -0.98 0.96
_Chromagram_Mean_1 400.00 0.35 0.27 0.32 0.00 0.06 0.27 0.55 1.00 0.70 -0.72
_Chromagram_Mean_2 400.00 0.25 0.14 0.29 0.00 0.02 0.14 0.40 1.00 1.21 0.43
_Chromagram_Mean_3 400.00 0.37 0.29 0.32 0.00 0.08 0.29 0.58 1.00 0.69 -0.73
_Chromagram_Mean_4 400.00 0.21 0.10 0.25 0.00 0.02 0.10 0.32 1.00 1.49 1.48
_Chromagram_Mean_5 400.00 0.35 0.27 0.30 0.00 0.09 0.27 0.54 1.00 0.78 -0.46
_Chromagram_Mean_6 400.00 0.26 0.14 0.29 0.00 0.02 0.14 0.45 1.00 1.11 0.19
_Chromagram_Mean_7 400.00 0.24 0.14 0.28 0.00 0.03 0.14 0.36 1.00 1.33 0.93
_Chromagram_Mean_8 400.00 0.39 0.30 0.33 0.00 0.10 0.30 0.64 1.00 0.57 -0.96
_Chromagram_Mean_9 400.00 0.35 0.25 0.33 0.00 0.07 0.25 0.61 1.00 0.76 -0.77
_Chromagram_Mean_10 400.00 0.59 0.61 0.36 0.00 0.26 0.61 1.00 1.00 -0.22 -1.44
_Chromagram_Mean_11 400.00 0.34 0.25 0.32 0.00 0.06 0.25 0.57 1.00 0.71 -0.69
_Chromagram_Mean_12 400.00 0.39 0.30 0.35 0.00 0.06 0.30 0.67 1.00 0.55 -1.12
_HarmonicChangeDetectionFunction_Mean 400.00 0.33 0.33 0.06 0.11 0.29 0.33 0.37 0.49 -0.47 0.49
_HarmonicChangeDetectionFunction_Std 400.00 0.19 0.19 0.05 0.06 0.16 0.19 0.23 0.34 0.25 -0.11
_HarmonicChangeDetectionFunction_Slope 400.00 -0.00 -0.00 0.10 -0.28 -0.06 -0.00 0.06 0.44 0.20 0.93
_HarmonicChangeDetectionFunction_PeriodFreq 400.00 1.76 1.68 0.93 0.19 0.96 1.68 2.24 4.49 0.40 -0.29
_HarmonicChangeDetectionFunction_PeriodAmp 400.00 0.77 0.79 0.07 0.53 0.72 0.79 0.82 0.91 -0.76 0.18
_HarmonicChangeDetectionFunction_PeriodEntropy 400.00 0.97 0.97 0.00 0.94 0.96 0.97 0.97 0.98 -1.48 7.57
In [30]:
def visualizar_distribuciones_numericas(df):
    """
    Crea un histograma y un diagrama de caja para cada columna numérica en el DataFrame,
    con títulos mejorados en 3 líneas y una paleta de colores inspirada en la música turca.
    """
    # Seleccionar solo columnas numéricas
    df_num = df.select_dtypes(include=['number'])
    
    if df_num.empty:
        print("No se encontraron columnas numéricas para visualizar.")
        return
        
    print("Visualización de Distribuciones Numéricas:")
    
    # Paleta de colores inspirada en la música turca
    colores_turcos = ["#E9A000", "#1C4E80", "#A52A2A", "#8B4513", "#4A8C80", "#D36E1B", "#663399"]
    
    for i, col in enumerate(df_num.columns):
        # AQUÍ ESTÁ EL CAMBIO: figsize aumentado para más espacio
        fig, axes = plt.subplots(1, 2, figsize=(10.5, 3.75))
        
        color_base = colores_turcos[i % len(colores_turcos)]

        # --- Histograma con línea de densidad (KDE) ---
        sns.histplot(df[col], kde=True, ax=axes[0], bins=30, color=color_base, edgecolor='white')
        
        # TÍTULO MEJORADO EN TRES LÍNEAS
        axes[0].set_title(f'Distribución de la Variable\n"{col}"\n(Histograma y KDE)', fontsize=10)
        
        mean_val = df[col].mean()
        median_val = df[col].median()
        axes[0].axvline(mean_val, color='#C0C0C0', linestyle='--', linewidth=2, label=f'Media: {mean_val:.2f}')
        axes[0].axvline(median_val, color='#FFD700', linestyle='-', linewidth=2, label=f'Mediana: {median_val:.2f}')
        axes[0].legend(fontsize='small')
        axes[0].set_xlabel(col, fontsize=8)
        axes[0].set_ylabel('Frecuencia', fontsize=8)
        
        # --- Diagrama de caja (Box Plot) ---
        sns.boxplot(x=df[col], ax=axes[1], color=color_base, flierprops=dict(markerfacecolor='#B22222', marker='D'))
        
        # TÍTULO MEJORADO EN TRES LÍNEAS
        axes[1].set_title(f'Análisis de Dispersión\n"{col}"\n(Box Plot y Outliers)', fontsize=10)
        axes[1].set_xlabel(col, fontsize=8)
        
        # Aumentamos el padding para dar más aire a los títulos
        plt.tight_layout(pad=2.0)
        plt.show()

# --- Llama a la función para ver los gráficos ---
visualizar_distribuciones_numericas(df)
Visualización de Distribuciones Numéricas:
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Valores únicos por variable para identificar posibles variables categóricas

In [31]:
def analizar_cardinalidad(df):
    """
    Analiza y muestra la cardinalidad (valores únicos) de cada columna 
    de una manera visualmente impactante.
    """
    # Crear un DataFrame con los conteos y porcentajes de valores únicos
    cardinalidad_df = pd.DataFrame({
        'Valores Únicos': df.nunique(),
        'Cardinalidad (%)': round(df.nunique() * 100 / len(df), 2)
    }).sort_values(by='Cardinalidad (%)', ascending=False)

    print(f"Análisis de Cardinalidad para {len(df.columns)} columnas:")

    # Aplicar estilo para un análisis visual rápido
    styled_df = (cardinalidad_df.style
                 .background_gradient(cmap='magma', subset=['Cardinalidad (%)'])
                 .format({'Cardinalidad (%)': '{:.2f}%'})
                 .bar(subset=['Valores Únicos'], color='#5fba7d', align='zero')
                 .set_caption("Unicidad y Cardinalidad de las Columnas")
                )
    return styled_df

# --- Llama a la función para un análisis enfocado en la cardinalidad ---
analizar_cardinalidad(df)
Análisis de Cardinalidad para 51 columnas:
Out[31]:
Unicidad y Cardinalidad de las Columnas
  Valores Únicos Cardinalidad (%)
_Spectralcentroid_Mean 388 97.00%
_Tempo_Mean 388 97.00%
_Roughness_Mean 388 97.00%
_Spectralspread_Mean 388 97.00%
_Rolloff_Mean 388 97.00%
_Zero-crossingrate_Mean 388 97.00%
_Spectralkurtosis_Mean 381 95.25%
_Fluctuation_Mean 377 94.25%
_Spectralskewness_Mean 357 89.25%
_MFCC_Mean_1 354 88.50%
_MFCC_Mean_2 347 86.75%
_MFCC_Mean_3 319 79.75%
_MFCC_Mean_4 316 79.00%
_MFCC_Mean_7 304 76.00%
_MFCC_Mean_5 297 74.25%
_MFCC_Mean_6 297 74.25%
_Roughness_Slope 292 73.00%
_Chromagram_Mean_5 286 71.50%
_MFCC_Mean_9 278 69.50%
_Brightness_Mean 277 69.25%
_AttackTime_Slope 274 68.50%
_Chromagram_Mean_11 273 68.25%
_MFCC_Mean_8 273 68.25%
_MFCC_Mean_12 272 68.00%
_MFCC_Mean_10 271 67.75%
_Chromagram_Mean_8 269 67.25%
_Pulseclarity_Mean 266 66.50%
_Chromagram_Mean_3 263 65.75%
_Chromagram_Mean_9 262 65.50%
_MFCC_Mean_13 259 64.75%
_Chromagram_Mean_1 259 64.75%
_MFCC_Mean_11 253 63.25%
_Chromagram_Mean_12 252 63.00%
_Chromagram_Mean_10 245 61.25%
_Chromagram_Mean_7 243 60.75%
_Chromagram_Mean_6 241 60.25%
_Chromagram_Mean_2 240 60.00%
_Chromagram_Mean_4 240 60.00%
_HarmonicChangeDetectionFunction_Slope 237 59.25%
_RMSenergy_Mean 196 49.00%
_HarmonicChangeDetectionFunction_PeriodAmp 196 49.00%
_HarmonicChangeDetectionFunction_Mean 178 44.50%
_Lowenergy_Mean 166 41.50%
_Eventdensity_Mean 163 40.75%
_HarmonicChangeDetectionFunction_Std 159 39.75%
_EntropyofSpectrum_Mean 134 33.50%
_Spectralflatness_Mean 94 23.50%
_AttackTime_Mean 61 15.25%
_HarmonicChangeDetectionFunction_PeriodFreq 40 10.00%
_HarmonicChangeDetectionFunction_PeriodEntropy 26 6.50%
Class 4 1.00%

Búsqueda de valores faltantes

In [32]:
def analizar_datos_faltantes(df):
    """
    Analiza y muestra los datos faltantes de una manera impactante, mostrando
    únicamente las columnas que contienen valores nulos.
    """
    # Calcula el conteo y porcentaje de nulos por columna
    valores_faltantes = df.isnull().sum()
    porcentaje_faltante = round(valores_faltantes * 100 / len(df), 2)
    
    # Crea un DataFrame con los resultados
    resumen_faltantes = pd.DataFrame({
        'Valores Faltantes': valores_faltantes,
        'Porcentaje (%)': porcentaje_faltante
    })
    
    # Filtra para mostrar solo columnas con datos faltantes y ordena de mayor a menor
    resumen_faltantes = resumen_faltantes[resumen_faltantes['Valores Faltantes'] > 0].sort_values(
        by='Porcentaje (%)', ascending=False
    )
    
    # Si el DataFrame resultante está vacío, ¡felicidades!
    if resumen_faltantes.empty:
        print("🎉 ¡Excelente! No se encontraron datos faltantes en el DataFrame.")
        return
        
    print(f"Análisis de Datos Faltantes: {len(resumen_faltantes)} de {len(df.columns)} columnas tienen valores nulos.")
    
    # Aplica estilo para resaltar los problemas
    styled_resumen = (resumen_faltantes.style
                      .background_gradient(cmap='Reds', subset=['Porcentaje (%)'])
                      .format({'Porcentaje (%)': '{:.2f}%'})
                      .bar(subset=['Valores Faltantes'], color='#d65f5f', align='zero')
                      .set_caption("Columnas con Datos Faltantes")
                     )
    
    return styled_resumen

# --- Llama a la función para un diagnóstico preciso de datos faltantes ---
analizar_datos_faltantes(df)
🎉 ¡Excelente! No se encontraron datos faltantes en el DataFrame.

Diagrama de barras para determinar la frecuencia de las emociones

In [44]:
def countplot_sentimientos(df, col='Class'):
    """
    Versión del gráfico brutal con un ajuste en los márgenes del eje X
    y el título más cerca del gráfico.
    """
    # --- 1. Preparación de Datos y Paleta de Colores Suavizada ---
    colores_suaves = ["#C89F9C", "#A2D4AB", "#F0E2B6", "#B4A2D4"]
    data = df[col].value_counts()
    order = data.index
    total = len(df[col])
    palette = colores_suaves[:len(order)]

    # --- 2. Creación de la Figura con Tema Oscuro ---
    fig, ax = plt.subplots(figsize=(12, 7))
    fig.patch.set_facecolor('#1E1E1E')
    ax.set_facecolor('#1E1E1E')

    # --- 3. Dibujo de Barras con Gradiente ---
    for i, (category, count) in enumerate(data.items()):
        base_color = palette[i]
        gradient = np.linspace(0.8, 1.0, 256).reshape(-1, 1)
        rgb_color = np.array([int(base_color.lstrip('#')[j:j+2], 16) / 255.0 for j in (0, 2, 4)])
        gradient_colors = plt.cm.colors.ListedColormap(rgb_color * gradient)
        ax.imshow(np.arange(256).reshape(-1, 1), cmap=gradient_colors, 
                  extent=[i - 0.4, i + 0.4, 0, count], aspect='auto', interpolation='bilinear')

    # --- 4. Anotaciones y Tipografía de Alto Impacto ---
    for i, count in enumerate(data):
        percentage = f'{(count / total) * 100:.1f}%'
        ax.text(i, count - (0.05 * data.max()), percentage, 
                ha='center', va='center', color='white', fontsize=14, weight='bold',
                bbox=dict(facecolor='black', alpha=0.3, boxstyle='round,pad=0.2', edgecolor='none'))

    # --- 5. Limpieza y Estilo Final ---
    ax.set_xticks(range(len(order)))
    ax.set_xticklabels(order, fontsize=12, color='white', weight='medium')
    ax.set_ylim(0, data.max() * 1.1)
    
    ax.set_xlim(-0.5, len(order) - 0.5)
    
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_color('#444444')
    ax.tick_params(axis='x', colors='white', length=0)
    ax.set_yticks([])
    ax.set_ylabel('')
    ax.set_xlabel('')
    
    plt.title('Distribución de Clases', fontsize=20, color='#DDDDDD', weight='bold', pad=1)
    plt.show()

# --- Llama a la función ajustada ---
countplot_sentimientos(df)
No description has been provided for this image

Histogramas para cada característica numérica, para ver qué tan equilibrados están los datos

In [50]:
def visualizar_violines(df, categoria_col='Class'):
    """
    Crea una cuadrícula de gráficos de violín
    """
    df_num = df.select_dtypes(include=np.number)
    if 'Track ID' in df_num.columns:
        df_num = df_num.drop(columns=['Track ID'])

    if categoria_col not in df.columns:
        print(f"Error: La columna '{categoria_col}' no se encuentra en el DataFrame.")
        return

    # Paleta de colores y ajuste dinámico al número de categorías
    n_categorias = df[categoria_col].nunique()
    colores_suaves = ["#C89F9C", "#A2D4AB", "#F0E2B6", "#B4A2D4"]
    palette_ajustada = colores_suaves[:n_categorias]
    
    n_variables = len(df_num.columns)
    n_cols_grid = 3
    n_rows_grid = (n_variables + n_cols_grid - 1) // n_cols_grid

    # Crear la figura con tema oscuro
    fig, axes = plt.subplots(n_rows_grid, n_cols_grid, figsize=(n_cols_grid * 5.5, n_rows_grid * 4.5))
    fig.patch.set_facecolor('#1E1E1E')
    
    # AQUÍ ESTÁ EL CAMBIO: Título ajustado para estar más cerca
    fig.suptitle('Análisis de Distribución por Clase', fontsize=22, color='white', weight='bold', y=0.99)

    axes = axes.flatten()

    for i, col in enumerate(df_num.columns):
        ax = axes[i]
        ax.set_facecolor('#1E1E1E')

        # AQUÍ ESTÁ EL CAMBIO: 'hue' añadido y 'legend=False' para eliminar la advertencia
        sns.violinplot(
            x=categoria_col, y=col, data=df, ax=ax,
            palette=palette_ajustada, inner='box', linewidth=1.5,
            hue=categoria_col, legend=False
        )

        ax.set_title(col.replace('_', ' ').capitalize(), fontsize=14, color='white', weight='bold', pad=12)
        ax.set_xlabel('')
        ax.set_ylabel('')
        ax.tick_params(axis='x', colors='white', labelrotation=0)
        ax.tick_params(axis='y', colors='white')
        
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        ax.spines['left'].set_color('#444444')
        ax.spines['bottom'].set_color('#444444')
        ax.grid(axis='y', linestyle='--', alpha=0.2)

    for j in range(i + 1, len(axes)):
        axes[j].set_visible(False)

    plt.tight_layout(rect=[0, 0, 1, 0.95])
    plt.show()


# --- CÓMO USARLA ---
visualizar_violines(df)
No description has been provided for this image

Q - Q plots para ver la distribución después de haber "normalizado" los datos

In [10]:
variables_trans = df.columns.to_list()
variables_trans.remove("Class")
In [10]:
transformer = PowerTransformer(method="yeo-johnson", standardize=False)
In [11]:
transformer.fit(df[variables_trans])
Out[11]:
PowerTransformer(standardize=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
method  'yeo-johnson'
standardize  False
copy  True
In [12]:
df[variables_trans] = transformer.transform(df[variables_trans])
In [13]:
n = len(variables_trans)

ncols = 8
nrows = math.ceil(n / ncols)

# ---------- FIGURA 1: HISTOGRAMAS ----------
fig1, axes1 = plt.subplots(nrows, ncols, figsize=(ncols*2.0, nrows*1.8))  # plots más compactos
axes1 = axes1.flatten() if nrows*ncols > 1 else [axes1]

for i, col in enumerate(variables_trans):
    ax = axes1[i]
    data = df[col].dropna()
    ax.hist(data, bins=20, edgecolor='white')
    ax.set_title(col, fontsize=9)
    ax.tick_params(labelsize=8)

# apaga ejes sobrantes si hay
for j in range(i+1, nrows*ncols):
    axes1[j].axis('off')

fig1.suptitle('Histogramas', fontsize=12, y=1.02)
fig1.tight_layout()
plt.show()

# ---------- FIGURA 2: Q-Q PLOTS ----------
fig2, axes2 = plt.subplots(nrows, ncols, figsize=(ncols*2.0, nrows*1.8))
axes2 = axes2.flatten() if nrows*ncols > 1 else [axes2]

for i, col in enumerate(variables_trans):
    ax = axes2[i]
    data = df[col].dropna()
    stats.probplot(data, dist="norm", plot=ax)
    ax.set_title(col, fontsize=9)
    ax.tick_params(labelsize=8)

for j in range(i+1, nrows*ncols):
    axes2[j].axis('off')

fig2.suptitle('Q-Q plots vs Normal', fontsize=12, y=1.02)
fig2.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image

Tablas de frecuencia para cada característica categórica

In [14]:
for column in df.select_dtypes(include=[ 'object', 'bool']).columns:
  display(column, pd.crosstab(index=df[column], columns='% observations', normalize='columns') * 100)
'Class'
col_0 % observations
Class
angry 25.0
happy 25.0
relax 25.0
sad 25.0
In [15]:
label_encoder = LabelEncoder()
df["Class"] = label_encoder.fit_transform(df['Class'])
df
Out[15]:
Class _RMSenergy_Mean _Lowenergy_Mean _Fluctuation_Mean _Tempo_Mean _MFCC_Mean_1 _MFCC_Mean_2 _MFCC_Mean_3 _MFCC_Mean_4 _MFCC_Mean_5 ... _Chromagram_Mean_9 _Chromagram_Mean_10 _Chromagram_Mean_11 _Chromagram_Mean_12 _HarmonicChangeDetectionFunction_Mean _HarmonicChangeDetectionFunction_Std _HarmonicChangeDetectionFunction_Slope _HarmonicChangeDetectionFunction_PeriodFreq _HarmonicChangeDetectionFunction_PeriodAmp _HarmonicChangeDetectionFunction_PeriodEntropy
0 2 0.046255 1.147712 0.785693 49.459481 1.883949 0.377859 0.865352 0.079494 0.219944 ... 0.269816 1.136109 0.007925 0.091497 0.537622 0.200511 0.017918 0.807322 6.762849 5.938439e+30
1 2 0.095623 0.727046 0.764910 52.943619 1.900475 0.545123 0.767637 0.433917 0.549911 ... 0.001995 1.136109 -0.000000 0.487793 0.460935 0.169693 -0.083691 1.931502 12.210196 5.010738e+30
2 2 0.041457 1.302833 0.793466 65.444342 1.512479 0.986259 0.494373 0.354595 0.285253 ... 0.147740 0.824692 0.015702 0.491675 0.821232 0.222127 0.129709 1.179722 11.586369 3.993609e+30
3 2 0.101274 1.185439 0.792824 29.467297 1.534865 1.775386 0.600987 0.380042 0.010997 ... 0.036190 1.136109 0.134976 0.424863 0.851187 0.202856 0.041560 0.319833 15.087478 5.302789e+30
4 2 0.056968 1.147712 0.789386 37.016335 1.656768 0.234033 0.795458 0.098256 0.430168 ... 0.003979 0.428451 0.447120 0.000999 0.615226 0.200511 0.087067 0.617198 10.533691 2.839124e+30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
395 0 0.121233 1.107620 0.744632 58.198474 1.582646 0.065509 0.703238 0.046522 0.263501 ... 0.248283 0.935896 0.275238 0.110760 0.555823 0.120580 0.116536 1.658183 27.746435 5.611702e+30
396 0 0.122175 0.878151 0.740478 63.229758 1.517594 -0.145470 0.338298 -0.010970 0.028981 ... 0.019488 1.136109 0.359050 0.009898 0.345856 0.110819 0.140000 1.931502 29.365195 5.010738e+30
397 0 0.127222 1.044541 0.733938 50.607703 1.081312 0.600708 0.858658 -0.109992 0.242721 ... 0.048665 0.189290 0.213246 0.091497 0.423771 0.132968 0.108024 1.931502 22.030962 3.773116e+30
398 0 0.104013 1.092413 0.728113 44.628036 1.221377 -0.205021 0.680127 0.090942 0.205078 ... 0.115906 1.136109 0.222443 0.122376 0.442152 0.123533 0.060080 1.931502 21.187036 5.611702e+30
399 0 0.071178 0.817475 0.746010 55.607331 1.318279 -0.013976 0.814626 -0.020891 0.342518 ... 0.087529 1.136109 0.084520 0.031917 0.271693 0.097658 0.006988 0.541146 25.344224 4.473616e+30

400 rows × 51 columns

In [16]:
df.corr()
Out[16]:
Class _RMSenergy_Mean _Lowenergy_Mean _Fluctuation_Mean _Tempo_Mean _MFCC_Mean_1 _MFCC_Mean_2 _MFCC_Mean_3 _MFCC_Mean_4 _MFCC_Mean_5 ... _Chromagram_Mean_9 _Chromagram_Mean_10 _Chromagram_Mean_11 _Chromagram_Mean_12 _HarmonicChangeDetectionFunction_Mean _HarmonicChangeDetectionFunction_Std _HarmonicChangeDetectionFunction_Slope _HarmonicChangeDetectionFunction_PeriodFreq _HarmonicChangeDetectionFunction_PeriodAmp _HarmonicChangeDetectionFunction_PeriodEntropy
Class 1.000000 -0.300955 0.174327 0.392882 -0.056205 0.362134 0.068330 -0.118221 -0.050932 -0.045488 ... 0.028964 -0.111679 -0.054256 0.044204 0.314104 0.669975 0.019715 -0.201242 -0.597808 -0.063368
_RMSenergy_Mean -0.300955 1.000000 -0.281635 -0.177114 -0.014712 -0.171867 -0.025694 0.075546 0.004651 -0.049314 ... 0.128949 0.131721 0.163869 0.029282 -0.014093 -0.369574 -0.091747 0.120925 0.355068 -0.032966
_Lowenergy_Mean 0.174327 -0.281635 1.000000 0.155663 -0.042172 0.118139 0.133686 -0.067116 0.056634 0.018319 ... 0.074131 -0.025039 0.066540 -0.034081 0.023246 0.207735 0.223859 -0.088219 -0.196982 -0.021062
_Fluctuation_Mean 0.392882 -0.177114 0.155663 1.000000 -0.106790 0.124968 0.147555 -0.130878 0.120289 -0.055065 ... -0.028560 -0.083241 0.001803 0.060133 0.338422 0.452092 0.104434 -0.069151 -0.306285 -0.040611
_Tempo_Mean -0.056205 -0.014712 -0.042172 -0.106790 1.000000 -0.075108 0.083978 0.017875 0.025673 0.072101 ... 0.060861 0.059757 0.035827 0.015775 -0.089613 -0.137069 -0.058962 0.032283 0.059248 0.116209
_MFCC_Mean_1 0.362134 -0.171867 0.118139 0.124968 -0.075108 1.000000 0.040707 0.067129 0.049199 -0.107084 ... -0.177787 -0.094558 -0.177474 -0.060446 -0.004390 0.398240 0.020684 -0.126185 -0.431880 -0.072597
_MFCC_Mean_2 0.068330 -0.025694 0.133686 0.147555 0.083978 0.040707 1.000000 0.047537 0.358970 0.177046 ... 0.003020 0.009304 -0.117024 -0.025309 -0.041602 0.136278 0.147273 -0.052555 -0.165532 0.033975
_MFCC_Mean_3 -0.118221 0.075546 -0.067116 -0.130878 0.017875 0.067129 0.047537 1.000000 0.190250 0.104222 ... -0.079648 0.009654 -0.058884 -0.023069 -0.157477 -0.079437 0.076963 -0.040040 -0.026490 -0.058232
_MFCC_Mean_4 -0.050932 0.004651 0.056634 0.120289 0.025673 0.049199 0.358970 0.190250 1.000000 0.272867 ... -0.047560 0.062361 -0.090089 -0.025413 -0.075269 -0.004211 0.127470 -0.075101 -0.008205 -0.092063
_MFCC_Mean_5 -0.045488 -0.049314 0.018319 -0.055065 0.072101 -0.107084 0.177046 0.104222 0.272867 1.000000 ... 0.057164 0.005268 0.023767 0.030171 -0.085922 -0.104762 0.037293 0.027463 0.062554 0.013751
_MFCC_Mean_6 -0.037669 -0.038272 -0.006840 -0.112516 0.064635 0.063804 0.181029 0.121178 0.337789 0.379467 ... -0.018752 0.068534 -0.040221 -0.080736 -0.119552 -0.078788 -0.011430 -0.051585 0.040636 -0.067341
_MFCC_Mean_7 -0.044928 -0.073890 -0.007664 -0.204872 0.109782 -0.033569 0.057125 0.097992 0.055655 0.105031 ... 0.065463 0.078696 -0.025926 -0.092893 -0.142082 -0.126186 0.014191 -0.080202 0.030478 0.041139
_MFCC_Mean_8 0.029137 -0.028576 0.002668 -0.009350 0.084302 0.023452 0.006055 0.048145 -0.048309 -0.006843 ... 0.105767 0.045640 -0.016686 -0.089824 -0.054623 -0.065312 -0.082687 -0.004595 0.004550 0.004134
_MFCC_Mean_9 0.032591 -0.061882 0.139184 -0.019740 0.089347 0.018528 0.133188 -0.023552 0.040513 -0.043807 ... 0.011716 -0.005635 0.017750 -0.119475 -0.086143 0.017109 0.091577 -0.070629 -0.074413 -0.028158
_MFCC_Mean_10 -0.020621 -0.024011 0.061075 0.092923 0.089506 -0.045888 0.054751 0.006540 0.005815 0.044985 ... 0.000884 -0.037011 0.001084 -0.096384 -0.044083 0.041982 0.036067 -0.047205 -0.047720 0.012345
_MFCC_Mean_11 -0.075638 0.028033 -0.062946 -0.045076 0.066297 -0.160400 0.114932 -0.119775 -0.011914 0.098805 ... 0.014056 0.030343 0.056202 0.065739 0.003975 -0.022370 -0.059868 -0.009596 0.085333 0.069530
_MFCC_Mean_12 0.028299 -0.040536 0.060556 0.047515 0.022166 0.016254 0.005449 0.003575 -0.034378 -0.072253 ... -0.052725 -0.008822 0.024351 0.098350 0.098797 0.082483 0.013647 -0.019403 -0.011682 0.079298
_MFCC_Mean_13 -0.067013 -0.067257 0.103279 -0.047603 0.068089 0.050632 0.133568 0.031047 -0.002487 0.058814 ... -0.086168 0.039328 0.008362 0.132387 -0.026787 0.036666 -0.010579 0.050582 0.005560 0.100646
_Roughness_Mean -0.394106 0.922060 -0.354121 -0.247770 0.022865 -0.276713 -0.090417 0.080286 -0.045444 -0.019835 ... 0.127504 0.154507 0.183550 0.057615 0.003768 -0.459690 -0.097954 0.120594 0.473365 0.000949
_Roughness_Slope 0.144049 -0.101701 0.035621 0.050208 0.036830 0.084562 0.104972 -0.026281 0.062458 -0.005208 ... -0.047285 0.024319 -0.088668 0.103516 0.109817 0.220078 0.086695 -0.086702 -0.164752 0.029399
_Zero-crossingrate_Mean -0.074970 0.026139 -0.015567 0.069289 0.004459 -0.682747 -0.288590 -0.310513 -0.263036 -0.112658 ... 0.205742 0.040806 0.255343 0.179463 0.235178 -0.174276 -0.073436 0.086946 0.331018 0.064745
_AttackTime_Mean 0.480993 -0.444078 0.459084 0.410047 -0.092952 0.192544 0.053319 -0.122228 -0.092701 -0.158953 ... -0.051039 -0.101806 -0.091793 -0.059036 0.079447 0.500102 0.153832 -0.122243 -0.504521 -0.095227
_AttackTime_Slope -0.110661 0.080719 -0.040061 -0.041544 -0.160343 0.003024 -0.204739 -0.010913 -0.025485 0.057418 ... 0.043832 -0.060792 0.096816 0.071297 0.020882 -0.094326 -0.090617 0.089473 0.112861 0.003476
_Rolloff_Mean -0.156709 0.164842 0.040105 0.110152 0.073996 -0.620981 0.278365 -0.251943 0.125691 0.008531 ... 0.199541 0.136537 0.173243 0.109965 0.100547 -0.173244 -0.002296 0.070895 0.257103 0.038293
_Eventdensity_Mean -0.484784 0.450370 -0.490128 -0.275265 0.101036 -0.448766 -0.205635 -0.048297 -0.063492 0.014219 ... 0.168344 0.100157 0.200335 0.065604 0.052560 -0.630868 -0.107198 0.262463 0.681950 0.128439
_Pulseclarity_Mean -0.372102 0.189068 0.011201 0.033147 0.117209 -0.309492 -0.066375 -0.173008 -0.042196 -0.046745 ... 0.176810 0.057012 0.181230 0.070518 0.030180 -0.392775 0.032244 0.156389 0.467441 0.151976
_Brightness_Mean -0.224204 0.118511 -0.069372 0.003776 0.045190 -0.896142 -0.188497 -0.167434 -0.135427 -0.035968 ... 0.193288 0.080318 0.211502 0.126747 0.127376 -0.279805 -0.044780 0.109403 0.377307 0.051659
_Spectralcentroid_Mean -0.167885 0.154831 0.008884 0.088264 0.065080 -0.739102 0.141171 -0.287106 0.028881 -0.023646 ... 0.208623 0.116180 0.191573 0.125356 0.118459 -0.206763 -0.011408 0.085939 0.304879 0.038277
_Spectralspread_Mean -0.122575 0.197366 0.054832 0.098471 0.067792 -0.437120 0.373816 -0.221671 0.192448 -0.017684 ... 0.173048 0.151266 0.135917 0.102115 0.063392 -0.137951 0.018140 0.041517 0.201810 0.008683
_Spectralskewness_Mean 0.201855 -0.144788 0.012160 -0.083325 -0.090589 0.785463 -0.227533 0.233059 -0.099461 -0.055843 ... -0.221493 -0.108266 -0.161126 -0.090343 -0.099755 0.217282 -0.009631 -0.084267 -0.304221 -0.062772
_Spectralkurtosis_Mean 0.178580 -0.160859 -0.013604 -0.120152 -0.092898 0.702285 -0.303576 0.250561 -0.159243 -0.040663 ... -0.215957 -0.121465 -0.155683 -0.099912 -0.105316 0.180356 -0.011200 -0.066685 -0.272795 -0.054307
_Spectralflatness_Mean -0.058126 0.063555 0.089920 0.043057 0.048616 -0.304261 0.299231 -0.193467 0.085180 -0.059916 ... 0.084293 0.077582 0.068592 0.053399 -0.026120 -0.089562 0.026035 0.044578 0.086967 -0.000055
_EntropyofSpectrum_Mean -0.219075 0.170018 -0.042593 0.033299 0.057302 -0.774152 -0.081432 -0.295714 -0.096215 -0.043524 ... 0.243881 0.105560 0.265868 0.175020 0.173405 -0.307949 -0.052169 0.113522 0.433486 0.044032
_Chromagram_Mean_1 -0.032565 0.035289 -0.002613 0.094381 0.060964 -0.064125 -0.058609 -0.134867 -0.054657 0.084882 ... 0.093008 -0.178847 0.353609 0.349807 0.257996 -0.038977 -0.115735 0.051425 0.221190 0.001559
_Chromagram_Mean_2 -0.033956 0.071248 -0.042269 0.011150 0.031261 -0.121695 -0.140485 -0.100036 -0.130305 0.025397 ... 0.444471 -0.090557 0.256742 0.316175 0.310582 -0.019188 -0.099637 -0.013200 0.238698 0.060834
_Chromagram_Mean_3 -0.117814 0.096670 -0.035443 -0.010909 0.011636 -0.010620 -0.050730 -0.074067 -0.012424 0.020332 ... -0.073335 0.108335 0.165910 -0.094055 0.070674 -0.084338 -0.111410 0.014367 0.136154 0.037803
_Chromagram_Mean_4 -0.035852 0.092103 -0.004291 0.014159 -0.049349 -0.092652 -0.087213 -0.116014 -0.129973 -0.025885 ... 0.409477 -0.195227 0.430896 0.156868 0.257533 -0.059484 -0.090352 -0.013615 0.232225 -0.029288
_Chromagram_Mean_5 0.031225 0.019026 -0.101486 0.038371 0.052184 -0.022024 -0.099362 -0.070167 -0.092000 -0.078961 ... -0.005299 0.103886 -0.125068 0.468105 0.117932 0.022755 -0.031876 0.016427 0.057214 -0.005967
_Chromagram_Mean_6 0.069666 -0.017463 0.059919 0.060916 0.007653 -0.064415 -0.146435 -0.045886 -0.133151 0.014098 ... 0.228630 -0.082557 0.397089 -0.091050 0.245944 -0.004566 0.011835 0.074252 0.126667 -0.078459
_Chromagram_Mean_7 0.252600 -0.084324 0.034529 0.207115 -0.022048 -0.015167 -0.041778 -0.086944 -0.090960 -0.028213 ... 0.267587 -0.134059 0.047911 0.492224 0.359056 0.174767 -0.033643 -0.033668 -0.002329 -0.018215
_Chromagram_Mean_8 0.136248 0.008533 0.031312 0.169993 0.015751 -0.067079 0.059728 -0.048600 0.035155 -0.019915 ... 0.295011 -0.065163 0.278067 -0.002241 0.314400 0.058605 -0.058629 0.077626 0.097986 -0.042300
_Chromagram_Mean_9 0.028964 0.128949 0.074131 -0.028560 0.060861 -0.177787 0.003020 -0.079648 -0.047560 0.057164 ... 1.000000 0.062536 0.368510 0.136012 0.285409 -0.104406 -0.051001 0.068966 0.254790 0.048431
_Chromagram_Mean_10 -0.111679 0.131721 -0.025039 -0.083241 0.059757 -0.094558 0.009304 0.009654 0.062361 0.005268 ... 0.062536 1.000000 0.064108 -0.046829 -0.120855 -0.144962 0.102380 0.000520 0.077126 0.007735
_Chromagram_Mean_11 -0.054256 0.163869 0.066540 0.001803 0.035827 -0.177474 -0.117024 -0.058884 -0.090089 0.023767 ... 0.368510 0.064108 1.000000 0.074993 0.227201 -0.161698 -0.091722 0.008532 0.303412 -0.013759
_Chromagram_Mean_12 0.044204 0.029282 -0.034081 0.060133 0.015775 -0.060446 -0.025309 -0.023069 -0.025413 0.030171 ... 0.136012 -0.046829 0.074993 1.000000 0.190464 0.047696 -0.050764 -0.001428 0.090829 -0.022490
_HarmonicChangeDetectionFunction_Mean 0.314104 -0.014093 0.023246 0.338422 -0.089613 -0.004390 -0.041602 -0.157477 -0.075269 -0.085922 ... 0.285409 -0.120855 0.227201 0.190464 1.000000 0.427895 -0.138121 0.046846 0.122117 0.075975
_HarmonicChangeDetectionFunction_Std 0.669975 -0.369574 0.207735 0.452092 -0.137069 0.398240 0.136278 -0.079437 -0.004211 -0.104762 ... -0.104406 -0.144962 -0.161698 0.047696 0.427895 1.000000 0.027965 -0.260338 -0.752416 -0.130781
_HarmonicChangeDetectionFunction_Slope 0.019715 -0.091747 0.223859 0.104434 -0.058962 0.020684 0.147273 0.076963 0.127470 0.037293 ... -0.051001 0.102380 -0.091722 -0.050764 -0.138121 0.027965 1.000000 -0.056598 -0.105679 -0.154811
_HarmonicChangeDetectionFunction_PeriodFreq -0.201242 0.120925 -0.088219 -0.069151 0.032283 -0.126185 -0.052555 -0.040040 -0.075101 0.027463 ... 0.068966 0.000520 0.008532 -0.001428 0.046846 -0.260338 -0.056598 1.000000 0.363481 0.077017
_HarmonicChangeDetectionFunction_PeriodAmp -0.597808 0.355068 -0.196982 -0.306285 0.059248 -0.431880 -0.165532 -0.026490 -0.008205 0.062554 ... 0.254790 0.077126 0.303412 0.090829 0.122117 -0.752416 -0.105679 0.363481 1.000000 0.074684
_HarmonicChangeDetectionFunction_PeriodEntropy -0.063368 -0.032966 -0.021062 -0.040611 0.116209 -0.072597 0.033975 -0.058232 -0.092063 0.013751 ... 0.048431 0.007735 -0.013759 -0.022490 0.075975 -0.130781 -0.154811 0.077017 0.074684 1.000000

51 rows × 51 columns

mapa de calor que cuantifique la correlación de las variables numéricas en el dataframe

In [17]:
data_df_numeric = df.select_dtypes(include=[np.number])

data_df_corr = data_df_numeric.corr()

plt.figure(figsize=(18,12))
sns.heatmap(data_df_corr, annot=True, linewidths=0.5)
plt.show()
No description has been provided for this image

Ingeniería de características¶

Media, mediana, desviación estándar por clase

In [18]:
if 'Class' in df.columns:
    numeric_cols = df.select_dtypes(include=[float, int]).columns
    numeric_cols = numeric_cols.drop('Class', errors='ignore')
    for col in numeric_cols[:5]:
        plt.figure(figsize=(6, 3))
        sns.boxplot(x='Class', y=col, data=df)
        plt.title(f"Boxplot de {col} por clase")
        plt.tight_layout()
        plt.show()
else:
    print("La columna 'Class' no existe en el DataFrame.")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Histogramas

In [19]:
for col in numeric_cols[:5]:
    plt.figure(figsize=(5,3))
    sns.kdeplot(data=df, x=col, hue="Class", fill=True)
    plt.title(f"Densidad de {col}")
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Correlaciones

Detección de valores atípicos (outliers)¶

Z‑score: marcar valores cuyo z > 3 o < −3:

In [20]:
z_scores = np.abs(stats.zscore(df[numeric_cols], nan_policy='omit'))
outlier_mask = (z_scores > 3)
outlier_counts = outlier_mask.sum(axis=0)
outlier_counts_series = pd.Series(outlier_counts, index=numeric_cols)
outlier_counts_series.sort_values(ascending=False).head(10)
Out[20]:
_MFCC_Mean_6                                      8
_MFCC_Mean_8                                      6
_MFCC_Mean_10                                     5
_Lowenergy_Mean                                   4
_MFCC_Mean_2                                      4
_HarmonicChangeDetectionFunction_PeriodEntropy    4
_MFCC_Mean_4                                      3
_MFCC_Mean_5                                      3
_MFCC_Mean_3                                      3
_MFCC_Mean_11                                     3
dtype: int64

IQR (Interquartile Range)

In [21]:
Q1 = df[numeric_cols].quantile(0.25)
Q3 = df[numeric_cols].quantile(0.75)
IQR = Q3 - Q1

is_outlier = ((df[numeric_cols] < (Q1 - 1.5 * IQR)) |
              (df[numeric_cols] > (Q3 + 1.5 * IQR)))
# cuenta
is_outlier.sum().sort_values(ascending=False).head(10)
Out[21]:
_MFCC_Mean_6                                      17
_Fluctuation_Mean                                 16
_HarmonicChangeDetectionFunction_PeriodEntropy    15
_MFCC_Mean_13                                     14
_MFCC_Mean_10                                     13
_MFCC_Mean_1                                      13
_AttackTime_Mean                                  13
_MFCC_Mean_8                                      12
_MFCC_Mean_9                                      11
_MFCC_Mean_5                                      11
dtype: int64

Limpieza y tratamiento de datos¶

Asegurar que la columna Class solo contiene las etiquetas esperadas y asegurar que las columnas numéricas no tienen valores “inf“, “NaN” u otros artefactos

In [22]:
print(df["Class"].unique())
[2 1 3 0]

Revisar duplicados

In [23]:
dup_count = df.duplicated().sum()
print("Duplicados:", dup_count)
if dup_count > 0:
    df = df.drop_duplicates()
Duplicados: 12

TODO: Eliminar columnas constantes (desviación cero)

In [24]:
stds = df[numeric_cols].std()
zero_std = stds[stds == 0].index.tolist()
print("Columnas constantes:", zero_std)
df = df.drop(columns=zero_std)
Columnas constantes: []

Version dataset limpio

In [25]:
df_original = df.to_csv("turkish_music_emotion_raw.csv", index=False)
df.to_csv("turkish_music_emotion_cleaned.csv", index=False)
In [26]:
X = df.loc[:,df.columns != "Class"]
Y = df.loc[:,df.columns == "Class"]
X
Out[26]:
_RMSenergy_Mean _Lowenergy_Mean _Fluctuation_Mean _Tempo_Mean _MFCC_Mean_1 _MFCC_Mean_2 _MFCC_Mean_3 _MFCC_Mean_4 _MFCC_Mean_5 _MFCC_Mean_6 ... _Chromagram_Mean_9 _Chromagram_Mean_10 _Chromagram_Mean_11 _Chromagram_Mean_12 _HarmonicChangeDetectionFunction_Mean _HarmonicChangeDetectionFunction_Std _HarmonicChangeDetectionFunction_Slope _HarmonicChangeDetectionFunction_PeriodFreq _HarmonicChangeDetectionFunction_PeriodAmp _HarmonicChangeDetectionFunction_PeriodEntropy
0 0.046255 1.147712 0.785693 49.459481 1.883949 0.377859 0.865352 0.079494 0.219944 0.120032 ... 0.269816 1.136109 0.007925 0.091497 0.537622 0.200511 0.017918 0.807322 6.762849 5.938439e+30
1 0.095623 0.727046 0.764910 52.943619 1.900475 0.545123 0.767637 0.433917 0.549911 0.881112 ... 0.001995 1.136109 -0.000000 0.487793 0.460935 0.169693 -0.083691 1.931502 12.210196 5.010738e+30
2 0.041457 1.302833 0.793466 65.444342 1.512479 0.986259 0.494373 0.354595 0.285253 0.142846 ... 0.147740 0.824692 0.015702 0.491675 0.821232 0.222127 0.129709 1.179722 11.586369 3.993609e+30
3 0.101274 1.185439 0.792824 29.467297 1.534865 1.775386 0.600987 0.380042 0.010997 0.145968 ... 0.036190 1.136109 0.134976 0.424863 0.851187 0.202856 0.041560 0.319833 15.087478 5.302789e+30
4 0.056968 1.147712 0.789386 37.016335 1.656768 0.234033 0.795458 0.098256 0.430168 0.296447 ... 0.003979 0.428451 0.447120 0.000999 0.615226 0.200511 0.087067 0.617198 10.533691 2.839124e+30
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
395 0.121233 1.107620 0.744632 58.198474 1.582646 0.065509 0.703238 0.046522 0.263501 0.105583 ... 0.248283 0.935896 0.275238 0.110760 0.555823 0.120580 0.116536 1.658183 27.746435 5.611702e+30
396 0.122175 0.878151 0.740478 63.229758 1.517594 -0.145470 0.338298 -0.010970 0.028981 0.039226 ... 0.019488 1.136109 0.359050 0.009898 0.345856 0.110819 0.140000 1.931502 29.365195 5.010738e+30
397 0.127222 1.044541 0.733938 50.607703 1.081312 0.600708 0.858658 -0.109992 0.242721 0.220548 ... 0.048665 0.189290 0.213246 0.091497 0.423771 0.132968 0.108024 1.931502 22.030962 3.773116e+30
398 0.104013 1.092413 0.728113 44.628036 1.221377 -0.205021 0.680127 0.090942 0.205078 0.062568 ... 0.115906 1.136109 0.222443 0.122376 0.442152 0.123533 0.060080 1.931502 21.187036 5.611702e+30
399 0.071178 0.817475 0.746010 55.607331 1.318279 -0.013976 0.814626 -0.020891 0.342518 0.276010 ... 0.087529 1.136109 0.084520 0.031917 0.271693 0.097658 0.006988 0.541146 25.344224 4.473616e+30

388 rows × 50 columns

In [27]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3,random_state=42)
In [28]:
standardscaler = StandardScaler()
x_train = standardscaler.fit_transform(X_train)
x_test = standardscaler.transform(X_test)

Logistic Regression

In [29]:
logrregression = LogisticRegression(random_state=0)
logrregression.fit(x_train,Y_train.values.ravel())
y_pred = logrregression.predict(x_test)
print(y_pred)
print(Y_test)
[3 2 1 2 0 1 1 3 1 1 0 2 1 3 3 1 2 3 3 3 0 0 2 1 0 3 0 0 2 0 3 2 3 0 0 2 3
 0 3 1 1 1 2 1 2 3 3 0 3 3 0 2 3 2 3 1 3 2 2 2 1 3 0 3 0 2 2 0 0 1 0 2 1 2
 2 2 1 0 2 3 1 1 0 3 1 3 3 1 3 0 2 0 2 3 0 2 2 0 2 2 3 0 0 2 1 3 0 0 2 3 0
 1 3 1 1 3 0]
     Class
279      3
46       2
172      1
42       2
359      0
..     ...
252      3
216      3
113      1
17       2
301      0

[117 rows x 1 columns]
In [30]:
accuracy = accuracy_score(Y_test,y_pred)
print(accuracy)
0.7777777777777778
In [31]:
cm = confusion_matrix(Y_test,y_pred)
cm
Out[31]:
array([[26,  0,  1,  2],
       [ 2, 23,  1,  2],
       [ 1,  0, 25, 11],
       [ 1,  2,  3, 17]])

KNN Classification

In [32]:
model = KNeighborsClassifier()
model.fit(x_train, Y_train.values.ravel())
y_pred = model.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.6495726495726496
In [33]:
cm = confusion_matrix(Y_test,y_pred)
cm
Out[33]:
array([[26,  2,  0,  1],
       [ 1, 27,  0,  0],
       [ 7,  4, 15, 11],
       [ 6,  4,  5,  8]])
In [34]:
model = KNeighborsClassifier(metric="manhattan")
model.fit(x_train, Y_train.values.ravel())
y_pred = model.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.6581196581196581
In [35]:
cm = confusion_matrix(Y_test,y_pred)
cm
Out[35]:
array([[26,  1,  0,  2],
       [ 1, 27,  0,  0],
       [ 6,  2, 14, 15],
       [ 6,  3,  4, 10]])

Support Vector Machine

In [36]:
svc = SVC(kernel='linear')
svc.fit(x_train,Y_train.values.ravel())
y_pred = svc.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.7264957264957265
In [37]:
cm = confusion_matrix(Y_test,y_pred)
cm
Out[37]:
array([[28,  0,  1,  0],
       [ 0, 24,  1,  3],
       [ 4,  0, 20, 13],
       [ 3,  3,  4, 13]])
In [38]:
svc = SVC(kernel='rbf')
svc.fit(x_train,Y_train.values.ravel())
y_pred = svc.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.7521367521367521
In [39]:
cm = confusion_matrix(Y_test,y_pred)
cm
Out[39]:
array([[23,  1,  0,  5],
       [ 0, 25,  1,  2],
       [ 0,  0, 25, 12],
       [ 1,  2,  5, 15]])

Gaussian Naive Bayes

In [40]:
gnb = GaussianNB()
gnb.fit(x_train,Y_train.values.ravel())
y_pred = gnb.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.7863247863247863
In [41]:
cm = confusion_matrix(Y_test,y_pred)
cm
Out[41]:
array([[23,  2,  1,  3],
       [ 0, 25,  0,  3],
       [ 0,  1, 29,  7],
       [ 0,  3,  5, 15]])

Decision Tree

In [42]:
dtc = DecisionTreeClassifier(criterion="entropy")
dtc.fit(x_train,Y_train.values.ravel())
y_pred = dtc.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.6752136752136753
In [43]:
cm = confusion_matrix(Y_test,y_pred)
cm
Out[43]:
array([[24,  1,  0,  4],
       [ 2, 22,  2,  2],
       [ 2,  3, 21, 11],
       [ 3,  4,  4, 12]])

Random Forest

In [44]:
rfc = RandomForestClassifier()
rfc.fit(x_train,Y_train.values.ravel())
y_pred = rfc.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.8205128205128205
In [45]:
cm = confusion_matrix(Y_test,y_pred)
cm
Out[45]:
array([[26,  0,  0,  3],
       [ 1, 26,  0,  1],
       [ 0,  1, 26, 10],
       [ 1,  2,  2, 18]])

Dimensionality Reduction

In [46]:
pca = PCA(n_components=8)
pca.fit(X)
x_pca = pca.transform(X)
transformed = pd.DataFrame(x_pca)
X=transformed
X
Out[46]:
0 1 2 3 4 5 6 7
0 9.003031e+29 46774.809364 -926.896269 -439.107779 -37.528514 -7.861490 -7.855499 -0.930623
1 -2.739867e+28 -15396.849296 -2013.061404 -762.800698 -55.460648 2.776469 -4.564301 5.838971
2 -1.044527e+30 -49533.852416 -3042.806624 -370.528999 33.938369 13.407332 -5.777505 16.406478
3 2.646531e+29 24525.715202 -4031.572854 -763.691484 40.897985 5.702465 8.370228 -16.578039
4 -2.199013e+30 -150150.116064 -1882.520731 -598.871764 -56.325918 11.225291 2.046949 -7.212007
... ... ... ... ... ... ... ... ...
383 5.735657e+29 41906.372599 -2568.793272 -679.189570 -31.280710 2.245369 5.644715 13.338374
384 -2.739867e+28 18278.013476 -2670.222907 -310.077079 64.282401 -18.758019 1.193700 16.190012
385 -1.265020e+30 -77168.512531 -545.411112 -79.530364 -8.676784 14.326846 2.173908 5.732842
386 5.735657e+29 42590.101548 -1501.956360 -439.647642 -44.886751 7.903570 4.653567 -1.812959
387 -5.645205e+29 -42199.305661 -1706.781605 -477.377457 -11.777035 5.836857 6.503257 12.737869

388 rows × 8 columns

In [47]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3,random_state=42)
In [48]:
x_train = standardscaler.fit_transform(X_train)
x_test = standardscaler.transform(X_test)
In [49]:
classifier = KNeighborsClassifier()
classifier.fit(x_train, Y_train.values.ravel())
Out[49]:
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
n_neighbors  5
weights  'uniform'
algorithm  'auto'
leaf_size  30
p  2
metric  'minkowski'
metric_params  None
n_jobs  None
In [50]:
y_pred = classifier.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.48717948717948717

Hyperparameter Optimization

In [ ]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(solver='lbfgs', max_iter=1000))
])

param_grid = {'classifier__C': [0.001, 0.01, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 10, 100]}

grid_search = GridSearchCV(pipeline, param_grid, cv=10, scoring='accuracy')
grid_search.fit(X_train_scaled, Y_train)

# Get the best hyperparameters and accuracy
best_C = grid_search.best_params_['classifier__C']
best_accuracy = grid_search.best_score_
print("Best C:", best_C)
print("Best Accuracy:", best_accuracy)

Model Evaluation

In [ ]:
 
In [ ]: